Student Matricola Email
Protani Andrea 1860126
Tromboni Gabriele 2088799
Boesso Simone 1800408

Exercise 1

Stopping Time

Process: suppose that \(X \sim \text{Unif}(0, 1)\), and suppose we independently draw \(\{Y1, Y2, Y3, . . .\}\) from yet another \(\text{Unif}(0, 1)\) model until we reach the random stopping time \(T\) such that \((Y_T < X)\).
Question: it can be shown that the (marginal) PMF of \(T\) is such that Pr\((T = t) = \frac{1}{t(t+1)}\) for \(t \in \{1,2,3,...\}\). Setup a simulation in R that implements the sampling scheme above in order to numerically check this result. Quantitatively check how the simulation size impacts the approximation. Make some suitable plot to support your comments and conclusions.


What was our idea in order to write code as fast as possible?

  • The event \(T\) is {first instant time such that \((Y_T < X)\) happens}, that is the first “trial” at which that condition happens

  • For this very reason, \(T\) conditionally \(X=x\) is distributed as a Geometric Distribution with probability of success as \(x\), namely \((T|X=x) \sim Geom(x)\).

  • A variable with geometric distribution returns the number of times it takes for a certain condition to happen, which in our case is the waiting time \(T\) until \((Y_T<X)\)

So basically our code is based on the idea written above, picking randomly observations from a Geometric Distribution with probability of success equal to a random sample from a \(Unif(0,1)\). This will be the random stopping time that we increment by 1 because we want our counter to start from 1 and not from 0.
For the last M we will use an approach also based on parallelization through the library future.apply which allows us to speed up the process by a few tenths.

Let’s calculate the median time taken for each \(\texttt{M}\) to do the simulation.

library(future.apply, quietly = TRUE)

cores = parallel::detectCores()

for (m in 1:length(M)){
  m1 = M[m]
  for (m2 in 1:N){
    set.seed(13112221)
    if (m == 6){
      plan(multisession)
      beg <- Sys.time()
      new_M = M/cores
      t_data = rep(NA, cores)
      t_data = future_lapply(t_data, future.seed=T, function(m){
        t = rgeom(new_M, runif(new_M)) + 1
        return (t)
      })
      tempi[m2] = Sys.time() - beg
    }
    else {
      beg = Sys.time()
      t_data = rgeom(m1, runif(m1)) + 1
      tempi[m2] = Sys.time() - beg
    }
  }
  tempi_finali[m] = round(median(tempi), 3)
}
tempi_finali
## [1] 0.000 0.000 0.001 0.015 0.139 0.893

Let’s now compare with some plots our result with the theoretical distribution of \(T\) for each \(\texttt{M}\), namely \(\mathbb{P}(T = t) = \frac{1}{t(t+1)}\)

From the plots it can be seen that already with M=1000 the approximation is quite good and from M=10000 the real density and the estimated one practically coincide.

As a last step let’s visualize also how our code scale with the increase of \(\texttt{M}\).

Simulation Size M Time
100 \(\approx\) 0 sec
1.000 \(\approx\) 0 sec
10.000 \(\approx\) 0.002 sec
100.000 \(\approx\) 0.015 sec
1.000.000 \(\approx\) 0.14 sec
10.000.000 \(\approx\) 0.91 sec

To better appreciate the benefit of parallelizing the code we increase the simulation size to M=100.000.000

Let’s first run the base code and check the time that it takes:

## Time difference of 14.862 secs

And now let’s implement the multiprocessing approach:

set.seed(13112221)

new_M = 100000000
cores = parallel::detectCores()

plan(multisession)

beg_mp <- Sys.time()

M = new_M/cores

t_data = rep(NA, cores)
t_data = future_lapply(t_data, future.seed=T, function(m){
  t = rgeom(M, runif(M)) + 1
  return (t)
})

fin_mp <- Sys.time() - beg_mp
round(fin_mp, 3)
## Time difference of 4.951 secs

So these show that with a large \(\texttt{M}\) it is very useful to parallelize the code, in fact in this way is almost \(70\%\) faster.

Exercise 2

  1. Let’s focus on the univariate case with \(d = 1\) so that the the measurement space is the unit interval, \(\mathcal X = [0, 1]\). Assume also that the true density \(p_X (·)\) behind your data \(X\) is \(\underline{\text{known}}\) and equal to a Beta(\(\alpha = 10, \beta = 10\)). In this part of the exercise you have to setup up a simulation to compare the Mean Integrated Squared Error (\(\texttt{MISE}\), see below) between the true model \(p_X (·)\) and its two approximations \(\hat p_{n, m} (·)\) and \(\hat q_{\epsilon, m}(·)\).
    \(\texttt{MISE}(p_X, \hat p_{n, m}) = \mathbb{E}_{p_X} (\int_0^1 (p_X(x) - \hat p_{n, m}(x))^2 \text{d}x)\) = { \(\texttt{MISE}\) between the true model and the original histogram},
    \(\texttt{MISE}(p_X, \hat q_{\epsilon, m}) = \mathbb{E}_{p_X, Q} (\int_0^1 (p_X(x) - \hat q_{\epsilon, m}(x))^2 \text{d}x)\) = { \(\texttt{MISE}\) between the true model and the privatized histogram}.

    It is crucial to understand that here we are dealing with two sources of randomness: 1. the randomness due to iid-sampling from the population model \(P_X(·); 2\). the randomness due to the privacy mechanism $Q(·) $ for us, this is the \(\texttt{IID}\)-sampling from the \(\texttt{Laplace}\). Consequently, for a generic transformation \(r(·)\), the expectation \(\mathbb{E}_{p_X, Q}(·)\) above should be parsed as

    \[ \mathbb{E}_{p_X, Q}(r(Z_1, \cdots, Z_k)) \stackrel{\texttt{LLS}}{=} \int \bigg(\int r(z_1,..., z_k) \text{d}Q(z_1,...,z_k|x_1,...,x_n) \bigg) \text{d}P_X(x_1) \cdots P_X(x_n)\]

    Once this is clear (and you \(\underline{\text{must}}\) ask questions if it’s not!), the following, are the relevant simulation parameters to try:

    • Pick a large enough simulation size M that does not cook your CPU;
    • \(n \in \{100, 1000 \}\);
    • \(\epsilon \in \{0.1, 0.001 \}\);
    • \(m \in \text{grid}([5, 50])\), size the grid wisely: not too coarse to achieve decent resolution, not too fine to save CPU-time.

    We initialize the parameters by choosing M=1000 and m as a sequence from 5 to 50 of step 5.

First MISE: to estimate the first MISE we use the stepfun to approximate the histogram, we create a function to calculate the squared differences, i.e. the argument of the integral and we repeat the sampling and relative calculation of the integral for M times to then calculate the empirical mean.

par(mfrow=c(2, 2))
    for (n_rep in 1:length(n)){
      n1 = n[n_rep]
      for (i in 1:length(m)){
    m1 = m[i]
    for (j in 1:M){
      X = rbeta(n1, 10, 10)
      classes = seq(min(X), max(X), (max(X) - min(X)) / m1)
      p_hat = hist(X, breaks=classes, plot=F)
      step_fun = stepfun(p_hat$breaks, c(0, p_hat$density, 0))
      errors[j] = integrate(square_diff, 0, 1, subdivisions=1000)$value
      if (n1==100 & m1==25 & j==500){
        hist(X, breaks=classes, prob=T, col='cornflowerblue', border='white',
             main='Real density of a Beta(10, 10) vs\nthe histogram of the simulation',
             xlab='', ylab='Density', sub='Example with n=100 and m=25')
        grid()
        box()
        curve(dbeta(x, 10, 10), col=rgb(1,0,0,0.5), add=T, lwd=3)
      }
      if (n1==1000 & m1==25 & j==500){
        hist(X, breaks=classes, prob=T, col='cornflowerblue', border='white',
             main='Real density of a Beta(10, 10) vs\nthe histogram of the simulation',
             xlab='', ylab='Density', sub='Example with n=100 and m=25')
        grid()
        box()
        curve(dbeta(x, 10, 10), col=rgb(1,0,0,0.5), add=T, lwd=3)
      }
    }
    MISE1[n_rep, i] = round(mean(errors), 3)
      } 
    }

    plot(seq(5, 50, 5), MISE1[1,], type='o', pch=16, lwd=2, col='cornflowerblue',
     main='MISE behavior with\nn=100 and different m',
     xlab='m', ylab='MISE')
    grid()
    plot(seq(5, 50, 5), MISE1[2,], type='o', pch=16, lwd=2, col='cornflowerblue',
     main='MISE behavior with\nn=1000 and different m',
     xlab='m', ylab='MISE')
    grid()

Second MISE: first we use the \(\texttt{rlaplace}\) function of the VGAM package to perturb the histograms, at this point we pass from the new counts to the densities which we then pass to stepfun, then we continue as in the first MISE.

par(mfrow=c(2, 2))

for (n_rep in 1:length(n)){
  n1 = n[n_rep]
  for (eps in epsilon){
    for (i in 1:length(m)){
      m1 = m[i]
      for (j in 1:M){
        X = rbeta(n1, 10, 10)
        classes = seq(min(X), max(X), (max(X) - min(X)) / m1)
        p_hat = hist(X, breaks=classes, plot=F)
        nu = rlaplace(m1, 0, sqrt(4/(eps^2)))
        q_hat = p_hat$counts + nu
        q_hat[q_hat < 0] = 0
        bin_width = classes[2] - classes[1]
        q_hat_density = (q_hat/sum(q_hat))*(1/bin_width)
        step_fun = stepfun(p_hat$breaks, c(0, q_hat_density, 0))
        errors[j] = integrate(square_diff, 0, 1, subdivisions=1000, stop.on.error=FALSE)$value
        if (n1==100 & m1==25 & j==500 & eps==0.1){
          plot(step_fun, col='cornflowerblue',
               main='Approximation of the perturbed histogram\n vs the real density',
               lwd=3,
               xlab='x', ylab='Density', sub='Example with n=100, m=25 and epsilon=0.1')
          grid()
          box()
          curve(dbeta(x, 10, 10), col=rgb(1,0,0,0.5), add=T, lwd=3)
        }
        if (n1==1000 & m1==25 & j==500 & eps==0.001){
          plot(step_fun, col='cornflowerblue',
               main='Approximation of the perturbed histogram\n vs the real density',
               lwd=3,
               xlab='x', ylab='Density', sub='Example with n=1000, m=25 and epsilon=0.001')
          grid()
          box()
          curve(dbeta(x, 10, 10), col=rgb(1,0,0,0.5), add=T, lwd=3)
        }
      }
      if (n1 == 100 & eps == 0.1) MISE2$n100eps01[i] = round(mean(errors), 3)
      if (n1 == 100 & eps == 0.001) MISE2$n100eps0001[i] = round(mean(errors), 3)
      if (n1 == 1000 & eps == 0.1) MISE2$n1000eps01[i] = round(mean(errors), 3)
      if (n1 == 1000 & eps == 0.001) MISE2$n1000eps0001[i] = round(mean(errors), 3)
    }  
  }
}

plot(seq(5, 50, 5), MISE2$n100eps01, type='o', pch=16, lwd=2, col='cornflowerblue',
     main='MISE behavior with\nn=100 and different m',
     xlab='m', ylab='MISE', ylim=c(0, 8))
legend('bottomright', c('eps=0.1', 'eps=0.001'), lty=1, lwd=3, col=c('cornflowerblue', 'coral2'))
grid()
lines(seq(5, 50, 5), MISE2$n100eps0001, lwd=2, col='coral2', type='o', pch=16)
plot(seq(5, 50, 5), MISE2$n1000eps01, type='o', pch=16, lwd=2, col='cornflowerblue',
     main='MISE behavior with\nn=1000 and different m',
     xlab='m', ylab='MISE', ylim=c(0, 8))
legend('topleft', c('eps=0.1', 'eps=0.001'), lty=1, lwd=3, col=c('cornflowerblue', 'coral2'))
grid()
lines(seq(5, 50, 5), MISE2$n1000eps0001, lwd=2, col='coral2', type='o', pch=16)

  1. Repeat the exercise above by replacing the single Beta model with a mixture of 2 Beta’s (free to choose their parameters) that must induce some “sparsity” in the resulting histogram \(\hat p_{n,m}(·)\). In pseudo-R notation, pick \[ p_X(x) = \pi \cdot \text{dbeta}(x | \alpha_1, \beta_1) + (1- \pi) \cdot \text{dbeta}(x | \alpha_2, \beta_2)\] where \(\pi \in (0, 1)\) is the probability to pick observations from the first sub-population. Comparatively comment the results you got under these two different population scenarios: is there informational loss? Explain, possibly also evaluating \(\texttt{MISE} (\hat q_{\epsilon, m} , \hat p_{n,m} )\).

To choose the mixture we ‘played’ with manipulate to find a shape of the mixture that satisfied us.
At the end we chose: \(A \sim \text{Beta}(17, 5)\), \(B \sim \text{Beta}(4, 10)\) and \(\pi = 0.3\).
So as a first step we initialize the parameters and we create the following functions:
- \(\texttt{p_X}\): is the density of the mixture
- \(\texttt{rmixbeta}\): it will be used for the sample
- \(\texttt{square_diff}\): the new, slightly modified, square loss

Plots of chosen betas, comparison between marginals and generated mixture

Estimation of the first MISE:

Estimation of the second MISE:

Finally we compute \(\texttt{MISE}(\hat q_{\epsilon,m}, \hat p_{n,m})\), we have to modify the loss function again because in this case it has to calculate the differences between the approximations of two histograms, so it will receive two stepfunctions as input

We can see that in every try the MISE is lower when n=1000 in respect n=100, so, as we expected, the simulation size has a positive effect on the approximation.

One important thing to note is that as \(\epsilon\) decreases, information loss increases.

When n=100 the MISE increase constantly with m and the best choice for m is 5 or at most 10, regardless of simulated and perturbed histograms.

When n=1000 we have two different behaviors:

If we now compare the MISE between the mixed beta model and the single one, we can observe that the behaviors are similar but the mixture seems to have lower errors.

  1. Think hard. Can you figure out a simple/small (\(n < 100\), only one variable \(X\)) data collection you can realize in less than two weeks where privacy is key? Remember, the idea is that you can collect the data, but you do not wanna share them as they are (with me in particular) for further statistical analyses. All right, after the brain-storm, collect the data, privatize them with the perturbed histogram approach, and report your analyses to me by sharing a private dataset \(\mathcal Z_k = \{Z_1,...,Z_k\}\) together with some context (e.g. what was the main goal of the analysis, how you got the data, how did you choose m, how did you choose \(k\) and \(\epsilon\), what happened upon privatization, what are the relevant statistics/summaries I must reproduce from the privatized data, etc).

The main goal of our analysis is to derive statistics from the daily habits of us young people.

We created a Microsoft form with three “simple” questions regarding three main topics of our age:

After receiving a fair number of responses that we could analyze, we noticed that the most satisfying, uniform data and the data in which we were most interested were those concerning sexual habits.

To have an idea of what we are dealing with let’s plot an histogram and a summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      -1       1       5    9622      10 1000000

It is clear that there are some outliers that we need to remove…

Now it’s much better so we can move on.

Having received a hundred responses in order to get a fair trade-off between the number of bins and observations made, we thought of dividing our data into \(\texttt{m}=10\) bins so that we would have a satisfying amount of data for each. As for \(k\), on the other hand, which is the size of our privatized dataset through perturbed histogram approach, we chose \(80\), a high value that allows us to maintain the qualities of the informations received without actually showing the latter in its totality. Instead, the choice of \(\epsilon\) value in the Laplace mechanism fell to \(0.1\) in order to perturb the data significantly but without losing too much information, because an even lower value, as could have been \(0.001\), would have made the perturbed dataset too dispersive and misleading compared to the original.

Let’s normalize our data dividing each element for the max and plot a new histogram whit number of bins equal to 10.

Let’s apply the perturbation to our histogram in order to privatize the data. The information we would like to continue to have after perturbing the data are those shown with the summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.033   0.167   0.225   0.333   1.000

As a last step we sample \(k\) elements from our perturbed histogram, this sample will be the new dataset.

bin_width = classes[2] - classes[1]
q_hat_density = (q_hat/sum(q_hat))*(1/bin_width)

k = 80
from_bins = sample(x=c(1:length(histogram$counts)), size=k, replace=T, 
                     prob=q_hat_density/sum(q_hat_density))
new_values = rep(NA, k)
for (i in 1:length(from_bins)){
  bin = from_bins[i]
  new_values[i] = runif(1, histogram$breaks[bin], histogram$breaks[bin+1])
}
hist(new_values, prob=T, xlim=c(0, 1), ylim=c(0, 5), col='cornflowerblue', border='white',
     main='Histogram of the new dataset compared\nto the perturbed histogram', xlab='Bins')
plot(stepfun(histogram$breaks, c(0, q_hat_density, 0)), add=T, col='coral2', lwd=2, lty=2)
box()
grid()
legend('topright', c('Perturbed histogram', 'Sample from the\nperturbed histogram'), pch=15, cex=1.3, col=c('coral2', 'cornflowerblue'))

Let’s show main statistics comparing those concerning the perturbed histogram and privatized dataset sample itself. We can notice that the sample approximate quite well the perturbed histogram, but let’s check how many information we have lost.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.033   0.167   0.225   0.333   1.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.062   0.177   0.280   0.531   0.859

So in the end the new dataset will have dimension:

## [1] 80  1

Taking a look of what there is inside…

##   new_values
## 1 0.15164498
## 2 0.57715473
## 3 0.50994714
## 4 0.67650068
## 5 0.02090325
## 6 0.29925384
  1. (Bonus) Provide some evidence to support the claim that the perturbed histogram \(\hat q_{\epsilon, m}(·)\) in Equation 3 is indeed \(\epsilon\)-private as defined in Equation 1.

We decided to face the proof through a simulation that gives some evidence just for one \(\epsilon\), it is based on:

set.seed(69)

prova = function(j){
  tryCatch({
  X_samp = rbeta(n, 10, 10)
  new_X_samp = X_samp
  change = sample(1:n, 1)
  new_Xi = rbeta(1, 10, 10)
  new_X_samp[change] = new_Xi
  classes1 = seq(min(X_samp), max(X_samp), (max(X_samp) - min(X_samp)) / m)
  p_hat1 = hist(X_samp, breaks=classes1, plot=F)
  p_hat2 = hist(new_X_samp, breaks=classes1, plot=F)
  nu = rlaplace(m, 0, sqrt(4/(epsilon^2)))
  q_hat1 = p_hat1$counts + nu
  q_hat1[q_hat1 < 0] = 0
  q_hat2 = p_hat2$counts + nu
  q_hat2[q_hat2 < 0] = 0
  bin_width = classes1[2] - classes1[1]
  q_hat1_density = (q_hat1/sum(q_hat1))*(1/bin_width)
  q_hat2_density = (q_hat2/sum(q_hat2))*(1/bin_width)
  step_fun_q1 = stepfun(p_hat1$breaks, c(0, q_hat1_density, 0))
  step_fun_q2 = stepfun(p_hat2$breaks, c(0, q_hat2_density, 0))
  if (i==69 & j==69){
  plot(step_fun_q1, col='cornflowerblue',
       main='Stepfunctions of the simulations', lwd=3,
       xlab='x', ylab='Density', sub='Example with n=100, m=25 and epsilon=0.1')
  grid()
  box()
  plot(step_fun_q2, col='coral2', add=T, lwd=2, lty=3)
  legend('topright', c('q_hat1', 'q_hat2'), lty=c(1,3), lwd=3, col=c('cornflowerblue', 'coral2'))
  }
  div = q_hat1/q_hat2
  return (div)
  }, 
  error = function(e) NA)
}

for (i in 1:M2){
  for (j in 1:M){
    div2 = prova(j)
    division[j] = mean(div2[!is.na(div2) & !is.infinite(div2)])
  }
  condition[i] = max(division[!is.na(division) & !is.infinite(division)]) < exp(epsilon)
}

The condition was fulfilled 100 times out of 100 for an accuracy equal to 1.

## [1] 100
## [1] 1